NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

HILTS: Human-LLM collaboration for effective data labeling

https://doi.org/10.1016/j.is.2025.102660

Barbosa, Juliana; Alencar, Eduarda; Fan, Grace; Santos, Aécio; Freire, Juliana (December 2025, Information Systems)

Free, publicly-accessible full text available December 1, 2026
Large Language Models for Data Discovery and Integration: Challenges and Opportunities

Freire, Juliana; Fan, Grace; Feuer, Benjamin; Koutras, Christos; Liu, Yurong; Pena, Eduardo; Santos, Aécio; Silva, Cláudio T; Wu, Eden (April 2025, IEEE Data Engineering Bulletin)

Free, publicly-accessible full text available April 3, 2026
Efficiently Estimating Mutual Information Between Attributes Across Tables

https://doi.org/10.1109/ICDE60146.2024.00022

Santos, Aécio; Korn, Flip; Freire, Juliana (May 2024, IEEE)

Full Text Available
Sampling Methods for Inner Product Sketching

https://doi.org/10.14778/3665844.3665850

Daliri, Majid; Freire, Juliana; Musco, Christopher; Santos, Aécio; Zhang, Haoxiang (May 2024, Proceedings of the VLDB Endowment)

Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. We further develop this finding by introducing and analyzing two alternative sampling-based methods. In contrast to the computationally expensive algorithm in Bessa et al., our methods run in linear time (to compute the sketch) and perform better in practice, significantly beating linear sketching on a variety of tasks. For example, they provide state-of-the-art results for estimating the correlation between columns in unjoined tables, a problem that we show how to reduce to inner product estimation in a black-box way. While based on known sampling techniques (threshold and priority sampling) we introduce significant new theoretical analysis to prove approximation guarantees for our methods.
more » « less
Full Text Available
Simple Analysis of Priority Sampling

Daliri, Majid; Freire, Juliana; Musco, Christopher; Santos, Aécio; Zhang, Haoxiang (January 2024, SIAM Symposium on Simplicity in Algorithms)

We prove a tight upper bound on the variance of the priority sampling method (aka sequential Poisson sampling). Our proof is significantly shorter and simpler than the original proof given by Mario Szegedy at STOC 2006, which resolved a conjecture by Duffield, Lund, and Thorup.
more » « less
Full Text Available
Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation

https://doi.org/10.1145/3584372.3588679

Bessa, Aline; Daliri, Majid; Freire, Juliana; Musco, Cameron; Musco, Christopher; Santos, Aécio; Zhang, Haoxiang (June 2023, Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems)

We present a new approach for independently computing compact sketches that can be used to approximate the inner product between pairs of high-dimensional vectors. Based on the Weighted MinHash algorithm, our approach admits strong accuracy guarantees that improve on the guarantees of popular linear sketching approaches for inner product estimation, such as CountSketch and Johnson-Lindenstrauss projection. Specifically, while our method exactly matches linear sketching for dense vectors, it yields significantly lower error for sparse vectors with limited overlap between non-zero entries. Such vectors arise in many applications involving sparse data, as well as in increasingly popular dataset search applications, where inner products are used to estimate data covariance, conditional means, and other quantities involving columns in unjoined tables. We complement our theoretical results by showing that our approach empirically outperforms existing linear sketches and unweighted hashing-based sketches for sparse vectors.
more » « less
Full Text Available

Search for: All records